Collecting Natural SMS and Chat Conversations in Multiple Languages: The BOLT Phase 2 Corpus

نویسندگان

  • Zhiyi Song
  • Stephanie Strassel
  • Haejoong Lee
  • Kevin Walker
  • Jonathan Wright
  • Jennifer Garland
  • Dana Fore
  • Brian Gainor
  • Preston Cabe
  • Thomas Thomas
  • Brendan Callahan
  • Ann Sawyer
چکیده

The DARPA BOLT Program develops systems capable of allowing English speakers to retrieve and understand information from informal foreign language sources. Phase 2 of the program required large volumes of naturally occurring informal text (SMS) and chat messages from individual users in multiple languages to support evaluation of machine translation systems. We describe the design and implementation of a robust collection system capable of capturing both live and archived SMS and chat conversations from willing participants. We also discuss the challenges recruitment at a time when potential participants have acute and growing concerns about their personal privacy in the realm of digital communication, and we outline the techniques adopted to confront those challenges. Finally, we review the properties of the resulting BOLT Phase 2 Corpus, which comprises over 6.5 million words of naturally-occurring chat and SMS in English, Chinese and Egyptian Arabic.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus

This paper describes the process of creating a novel resource, a parallel Arabizi-Arabic script corpus of SMS/Chat data. The language used in social media expresses many differences from other written genres: its vocabulary is informal with intentional deviations from standard orthography such as repeated letters for emphasis; typos and nonstandard abbreviations are common; and nonlinguistic co...

متن کامل

Language and the Socio-Cultural Worlds of Those Who Use it: A Case of Vague Expressions

 The present study is an attempt to investigate the use of vague expressions by intermediate EFL learners. More specifically, the current study focuses on the structures and functions of one of the most common categories of vague language, i.e. general extenders. The data include a 22-hour corpus of English-as-a-foreign-language conversations. A comparison is also made between this corpus and a...

متن کامل

You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement

When multiple conversations occur simultaneously, a listener must decide which conversation each utterance is part of in order to interpret and respond to it appropriately. We refer to this task as disentanglement. We present a corpus of Internet Relay Chat (IRC) dialogue in which the various conversations have been manually disentangled, and evaluate annotator reliability. This is, to our know...

متن کامل

MPC: A Multi-Party Chat Corpus for Modeling Social Phenomena in Discourse

In this paper, we describe our experience with collecting and creating an annotated corpus of multi-party online conversations in a chat-room environment. This effort is part of a larger project to develop computational models of social phenomena such as agenda control, influence, and leadership in on-line interactions. Such models will help capturing the dialogue dynamics that are essential fo...

متن کامل

The Query of Everything: Developing Open-Domain, Natural-Language Queries for BOLT Information Retrieval

The DARPA BOLT Information Retrieval evaluations target open-domain natural-language queries over a large corpus of informal text in English, Chinese and Egyptian Arabic. We outline the goals of BOLT IR, comparing it with the prior GALE Distillation task. After discussing the properties of the BOLT IR corpus, we provide a detailed description of the query creation process, contrasting the summa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014